Estimating Prevalence Correctly

Complex Sampling in National Surveys

Mohd Azmi Bin Suliman

Pusat Penyelidikan Penyakit Tak Berjangkit, Institut Kesihatan Umum

Sunday, 16 November 2025

Institutes for Public Health (Institut Kesihatan Umum - IKU)

# A tibble: 252 × 5
   date       sex    age   ethnicity         pop_k
   <date>     <chr>  <chr> <chr>             <dbl>
 1 2025-01-01 female 0-4   overall          1114. 
 2 2025-01-01 female 0-4   bumi_malay        706. 
 3 2025-01-01 female 0-4   bumi_other        129. 
 4 2025-01-01 female 0-4   chinese           106. 
 5 2025-01-01 female 0-4   indian             45  
 6 2025-01-01 female 0-4   other_citizen      13.6
 7 2025-01-01 female 0-4   other_noncitizen  115. 
 8 2025-01-01 female 5-9   overall          1232. 
 9 2025-01-01 female 5-9   bumi_malay        729. 
10 2025-01-01 female 5-9   bumi_other        139. 
# ℹ 242 more rows
# A tibble: 36 × 5
   date       sex    age   ethnicity pop_k
   <date>     <chr>  <chr> <chr>     <dbl>
 1 2025-01-01 female 0-4   overall   1114.
 2 2025-01-01 female 5-9   overall   1232.
 3 2025-01-01 female 10-14 overall   1244.
 4 2025-01-01 female 15-19 overall   1289.
 5 2025-01-01 female 20-24 overall   1425.
 6 2025-01-01 female 25-29 overall   1355.
 7 2025-01-01 female 30-34 overall   1296.
 8 2025-01-01 female 35-39 overall   1333.
 9 2025-01-01 female 40-44 overall   1260.
10 2025-01-01 female 45-49 overall   1005.
# ℹ 26 more rows

Who are we?

  • National Health Surveys: Conducts large-scale surveys like NHMS to monitor Malaysia’s population health.
  • Public Health Research: Focuses on epidemiology, including non-communicable diseases, nutrition, communicable diseases, both among the general population and specific age groups.
  • Policy Support: Provides data-driven evidence to guide national health planning and interventions.

What we do?

NHMS 2025, Field

NHMS 2025, Field

NHMS 2025, Parliment

NHMS Reports

https://iku.nih.gov.my/nhms

Samples vs. Population

The Sampling Problem

The Sampling Problem

  • In describing a population, we often use a handful of samples rather than the whole population.

  • Unfortunately, sample distribution may differ from the population - gender, ethnicity, age.

  • Small studies typically limit their sample; clearly define the target population using inclusive and exclusive criteria.

  • But national surveys, including health surveys, require the sample to represent the general population (e.g., adult population, older person population, maternal and child population).

Malaysian Population

The codes

pacman::p_load(tidyverse, arrow)

pyr_df <- read_parquet("https://storage.dosm.gov.my/population/population_malaysia.parquet") %>%
  filter(date == as.Date("2025-01-01"), sex %in% c("male", "female"), 
         age != "overall", ethnicity == "overall") %>%
  mutate(pop_k = population, pop = if_else(sex == "male", -pop_k, pop_k), 
         age0 = readr::parse_number(age), age = fct_reorder(age, age0))

ggplot(pyr_df, aes(x = age, y = pop, fill = sex)) +
  geom_col(width = 0.9) + coord_flip() +
  scale_y_continuous(limits = c(-2000, 2000), breaks = seq(-2000, 2000, 500), 
                     labels = function(x) scales::comma(abs(x)), 
                     expand = expansion(mult = c(0.02, 0.02))) +
  labs(title = "Malaysia Population Pyramid, 2025", x = "Age group (years)", 
       y = "Population (thousands)", fill = "Sex") +
  theme_minimal(base_size = 13) + theme(panel.grid.minor = element_blank())

Complex Sampling

What is Complex Sampling?

  • Structured selection – Instead of simple random sampling, respondents are chosen through stratified and clustered sampling to ensure representation across diverse groups.

  • Unequal probabilities – Some groups are oversampled (e.g., small states, older adults) to obtain reliable estimates, necessitating the use of sampling weights to correct for these differences.

  • Design-based inference – Analysis must account for the survey’s design, including strata, clusters, and weights,so that standard errors and prevalence estimates accurately reflect the true population.

Why Complex Sampling?

  • Sampling: We use a sample to estimate the population efficiently, saving time, cost, and resources while still capturing key characteristics.

  • Stratification: Stratifying (by gender, ethnicity) ensures all important subgroups are represented and improves precision of estimates.

  • Clustering: Clustering respondents by area makes data collection logistically practical and cost-efficient.

Example - Diabetes among Malaysian (NHMS 2023)

Category Overall % 95% CI Male % 95% CI Female % 95% CI
Malaysia 15.6 14.4–16.9 15.0 13.6–16.5 16.2 14.7–18.0
Age Group
18–29 3.2 2.2–4.6 3.7 2.2–6.1 2.6 1.7–4.1
30–39 6.5 5.2–8.1 6.9 5.0–9.3 6.0 4.5–7.9
40–49 15.2 13.2–17.4 13.7 11.1–16.8 16.8 14.2–19.8
50–59 28.8 25.0–33.0 28.4 24.2–33.0 29.3 24.4–34.7
60+ 38.0 35.4–40.7 37.7 34.0–41.5 38.4 35.0–41.8
Ethnicity
Malay 16.2 15.1–17.4 15.5 14.1–17.1 16.9 15.4–18.4
Chinese 15.1 11.6–19.5 14.8 11.2–19.3 15.5 11.0–21.3
Indian 26.4 22.1–31.2 28.4 22.1–35.7 24.5 19.4–30.4
B. Sabah 9.3 7.3–11.8 9.5 6.8–13.0 9.1 6.5–12.6
B. Sarawak 17.2 13.0–22.3 14.9 10.4–21.0 19.3 14.3–25.6
Others 10.2 7.5–13.6 10.0 6.6–14.8 10.6 6.4–17.0

Simulation